1 Intro and Method

This notebook contains all code for parsing and analysis the protein clusters generated using MMSeq2.

2 Analysing the BLASTP all-versus-all results

A representative sequence from each of the 4 largest clusters compiled using MMSeq2 (with a percentage identity and coverage of cut-off of 70%) was extracted from a local CAZyme database using cazy_webscraper.

Each representative sequence was identified by using the GenBank accession assigned as the name of the cluster by MMSeq2.

The 4 representative sequences were analysed using BLASTP. The Python script run_blastp.py from the Python package pyrewton DOI:10.5281/zenodo.3876218) was used to run the BLASTP all-versus-all analysis.

The table compiled by BLASTP contains the following columns: - qseqid - sseqid - pident - length - mismatch - gapopen - qstart - qend - sstart - send - evalue - bitscore

In order to compare each of the pair-wise-alignment we need to calculate the Blast Score Ratio (SCR) to normalise for length.

The bitscore reported by BLAST is the sum of the qualities of the aligned symbols over the whole alignment. This is an accurate measure of the alignment strength, but long sequences tend to have higher bitscores than short sequences, even when the matches are of about the same quality. To correct for this length effect, we can calculate a normalised bitscore where:

normalised bitscore = bitscore / query length

Table 2.1 presents the raw output from BLASTP as well as the BSR from the all-verus-all BLASTP analysis of the representative sequences from the 4 largest clusters compiled by MMSeq2.

Table 2.1: Output from BLASTP all-vs-all analysis of the representative sequences from the 4 larges protein clusters
qseqid sseqid pident cov qlen slen alen bitscore evalue bsr
CBK69950.1 CBK69950.1 100.000 100 539 539 539 1121 0 2.0797774
CBK69950.1 AGE22437.1 49.171 100 539 567 543 520 0 0.9647495
CBK69950.1 CDG29680.1 49.631 99 539 533 542 520 0 0.9647495
CBK69950.1 QJR11213.1 44.383 99 539 534 543 412 0 0.7643785
QJR11213.1 QJR11213.1 100.000 100 534 534 534 1107 0 2.0730337
QJR11213.1 CDG29680.1 47.532 99 534 533 547 477 0 0.8932584
QJR11213.1 AGE22437.1 49.358 99 534 567 545 473 0 0.8857678
QJR11213.1 CBK69950.1 44.954 99 534 539 545 417 0 0.7808989
CDG29680.1 CDG29680.1 100.000 100 533 533 533 1118 0 2.0975610
CDG29680.1 AGE22437.1 66.417 99 533 567 533 753 0 1.4127580
CDG29680.1 CBK69950.1 49.631 99 533 539 542 520 0 0.9756098
CDG29680.1 QJR11213.1 47.091 99 533 534 550 471 0 0.8836773
AGE22437.1 AGE22437.1 100.000 100 567 567 567 1179 0 2.0793651
AGE22437.1 CDG29680.1 66.417 94 567 533 533 753 0 1.3280423
AGE22437.1 CBK69950.1 49.171 94 567 539 543 520 0 0.9171076
AGE22437.1 QJR11213.1 49.088 93 567 534 548 469 0 0.8271605

Figure 2.1 is an interactive plot presenting the BSR from the all-verus-all BLASTP analysis of the representative sequences from the 4 largest clusters compiled by MMSeq2.

To view the specific BSR for each comparison, hover over the plot and a tooltip will appear and will present the GenBank accessions of the corresponding proteins as well as the specific BSR value (to 3dp).

Figure 2.1: One-dimensional scatter plot of specificity scores of CAZyme and non-CAZyme predictions per test set, overlaying box plot of standard deviation.